Polynomial Regression

Example: QUADRATIC MODEL FOR PREDICTING THE US POPULATION

> setwd("C:\Users\baron\624 Data Analysis\data")

> read.csv("USpop.csv")

Year Population

1 1790 3.9

2 1800 5.3

3 1810 7.2

4 1820 9.6

5 1830 12.9

6 1840 17.1

7 1850 23.2

8 1860 31.4

9 1870 38.6

10 1880 50.2

11 1890 63.0

12 1900 76.2

13 1910 92.2

14 1920 106.0

15 1930 123.2

16 1940 132.2

17 1950 151.3

18 1960 179.3

19 1970 203.3

20 1980 226.5

21 1990 248.7

22 2000 281.4

23 2010 308.7

> Data = read.csv("USpop.csv")

> names(Data)

[1] "Year" "Population"

> attach(Data)

> plot(Year,Population)

> # LINEAR MODEL

> lin = lm(Population ~ Year)

> summary(lin)

Call:

lm(formula = Population ~ Year)

Residuals:

Min 1Q Median 3Q Max

-27.774 -24.872 -6.295 18.374 55.087

Coefficients:

Estimate Std. Error t value

(Intercept) -2.481e+03 1.672e+02 -14.84

Year 1.360e+00 8.794e-02 15.47

Pr(>|t|)

(Intercept) 1.33e-12 ***

Year 5.93e-13 ***

---

Signif. codes:

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’

0.1 ‘ ’ 1

Residual standard error: 27.97 on 21 degrees of freedom

Multiple R-squared: 0.9193, Adjusted R-squared: 0.9155

F-statistic: 239.3 on 1 and 21 DF, p-value: 5.927e-13

> abline(lin,col="red",lwd=3)

Ø # Clearly, the linear model is too inflexible and restrictive, it does not provide a good fit. This is underfitting. Notice, however, that R² in this regression is 0.9193. Without looking at the plot, we could have assumed that the model is very good!

> predict(lin,data.frame(Year=2020))

267.2166

Ø # This is obviously a poor prediction. The US population was already 308 million during the most recent Census.

Ø So, we probably omitted an important predictor. Residual plots will help us determine which one.

Ø Let’s produce various related plots.

Ø Partition the graphics window into 4 parts and use “plot”.

> par(mfrow=c(2,2))

> plot(lin)

Ø # The first plot shows that a quadratic term has been omitted although it is important in the population growth.

Ø # So, fit a quadratic model.

Ø Command I(…) means “inhibit interpretation”, it forces R to understand (…) literally, as Year squared.

> quadr = lm(Population ~ Year + I(Year^2))

> par(mfrow=c(1,1))

> plot(Year,Population)

Ø # Now let’s plot the fitted curve. Obtain fitted values.

> Yhat = fitted.values(quadr)

> Yhat

1 2 3 4

6.490870 5.870040 6.603913 8.692490

5 6 7 8

12.135771 16.933755 23.086443 30.593834

9 10 11 12

39.455929 49.672727 61.244229 74.170435

13 14 15 16

88.451344 104.086957 121.077273 139.422292

17 18 19 20

159.122016 180.176443 202.585573 226.349407

21 22 23

251.467945 277.941186 305.769130

> lines(Year,Yhat,col="blue",lwd=3)

# This looks like a very good fit! Only during WWII, the population grew at a rate slightly below the quadratic trend, although it quickly caught up with it during the Baby Boom.

> summary(quadr)

Call:

lm(formula = Population ~ Year + I(Year^2))

Residuals:

Min 1Q Median 3Q Max

-7.8220 -0.7130 0.5961 1.8344 3.7487

Coefficients:

Estimate Std. Error t value

(Intercept) 2.194e+04 5.795e+02 37.87

Year -2.438e+01 6.105e-01 -39.94

I(Year^2) 6.774e-03 1.606e-04 42.17

Pr(>|t|)

(Intercept) <2e-16 ***

Year <2e-16 ***

I(Year^2) <2e-16 ***

---

Signif. codes:

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’

0.1 ‘ ’ 1

Residual standard error: 3.023 on 20 degrees of freedom

Multiple R-squared: 0.9991, Adjusted R-squared: 0.999

F-statistic: 1.113e+04 on 2 and 20 DF, p-value: < 2.2e-16

Ø A higher R-squared is not surprising. It will always increase when we add new variables to the model. The fair criterion is Adjusted R-squared, when we compare models with a different number of parameters. Quadratic model has Adjusted R-squared = 0.999 comparing with 0.9155 for the linear model.

> predict(quadr,data.frame(Year=2020))

334.9518

Ø Now, this is a reasonable prediction for the year 2020.

> predict(quadr,data.frame(Year=2020),interval="confidence")

fit lwr upr

1 334.9518 330.6373 339.2663

> predict(quadr,data.frame(Year=2020),interval="prediction")

fit lwr upr

1 334.9518 327.311 342.5925

Ø Are the confidence and predictions intervals valid here? We’ll discuss regression assumptions next time.